16 research outputs found
Sequence to Sequence Learning for Query Expansion
Using sequence to sequence algorithms for query expansion has not been
explored yet in Information Retrieval literature nor in Question-Answering's.
We tried to fill this gap in the literature with a custom Query Expansion
engine trained and tested on open datasets. Starting from open datasets, we
built a Query Expansion training set using sentence-embeddings-based Keyword
Extraction. We therefore assessed the ability of the Sequence to Sequence
neural networks to capture expanding relations in the words embeddings' space.Comment: 8 pages, 2 figures, AAAI-19 Student Abstract and Poster Progra
Big model only for hard audios: Sample dependent Whisper model selection for efficient inferences
Recent progress in Automatic Speech Recognition (ASR) has been coupled with a
substantial increase in the model sizes, which may now contain billions of
parameters, leading to slow inferences even with adapted hardware. In this
context, several ASR models exist in various sizes, with different inference
costs leading to different performance levels. Based on the observation that
smaller models perform optimally on large parts of testing corpora, we propose
to train a decision module, that would allow, given an audio sample, to use the
smallest sufficient model leading to a good transcription. We apply our
approach to two Whisper models with different sizes. By keeping the decision
process computationally efficient, we build a decision module that allows
substantial computational savings with reduced performance drops.Comment: Submitted to ICASSP 202
Pretext Tasks selection for multitask self-supervised speech representation learning
Through solving pretext tasks, self-supervised learning leverages unlabeled
data to extract useful latent representations replacing traditional input
features in the downstream task. In audio/speech signal processing, a wide
range of features where engineered through decades of research efforts. As it
turns out, learning to predict such features (a.k.a pseudo-labels) has proven
to be a particularly relevant pretext task, leading to useful self-supervised
representations which prove to be effective for downstream tasks. However,
methods and common practices for combining such pretext tasks for better
performance on the downstream task have not been explored and understood
properly. In fact, the process relies almost exclusively on a computationally
heavy experimental procedure, which becomes intractable with the increase of
the number of pretext tasks. This paper introduces a method to select a group
of pretext tasks among a set of candidates. The method we propose estimates
calibrated weights for the partial losses corresponding to the considered
pretext tasks during the self-supervised training process. The experiments
conducted on automatic speech recognition, speaker and emotion recognition
validate our approach, as the groups selected and weighted with our method
perform better than classic baselines, thus facilitating the selection and
combination of relevant pseudo-labels for self-supervised representation
learning
Automatic Data Augmentation for Domain Adapted Fine-Tuning of Self-Supervised Speech Representations
Self-Supervised Learning (SSL) has allowed leveraging large amounts of
unlabeled speech data to improve the performance of speech recognition models
even with small annotated datasets. Despite this, speech SSL representations
may fail while facing an acoustic mismatch between the pretraining and target
datasets. To address this issue, we propose a novel supervised domain
adaptation method, designed for cases exhibiting such a mismatch in acoustic
domains. It consists in applying properly calibrated data augmentations on a
large clean dataset, bringing it closer to the target domain, and using it as
part of an initial fine-tuning stage. Augmentations are automatically selected
through the minimization of a conditional-dependence estimator, based on the
target dataset. The approach is validated during an oracle experiment with
controlled distortions and on two amateur-collected low-resource domains,
reaching better performances compared to the baselines in both cases.Comment: 6 pages,INTERSPEECH 202
Leveraging Data Collection and Unsupervised Learning for Code-switched Tunisian Arabic Automatic Speech Recognition
Crafting an effective Automatic Speech Recognition (ASR) solution for
dialects demands innovative approaches that not only address the data scarcity
issue but also navigate the intricacies of linguistic diversity. In this paper,
we address the aforementioned ASR challenge, focusing on the Tunisian dialect.
First, textual and audio data is collected and in some cases annotated. Second,
we explore self-supervision, semi-supervision and few-shot code-switching
approaches to push the state-of-the-art on different Tunisian test sets;
covering different acoustic, linguistic and prosodic conditions. Finally, and
given the absence of conventional spelling, we produce a human evaluation of
our transcripts to avoid the noise coming from spelling inadequacies in our
testing references. Our models, allowing to transcribe audio samples in a
linguistic mix involving Tunisian Arabic, English and French, and all the data
used during training and testing are released for public use and further
improvements.Comment: 6 pages, submitted to ICASSP 202
Speech Self-Supervised Representation Benchmarking: Are We Doing it Right?
Self-supervised learning (SSL) has recently allowed leveraging large datasets
of unlabeled speech signals to reach impressive performance on speech tasks
using only small amounts of annotated data. The high number of proposed
approaches fostered the need and rise of extended benchmarks that evaluate
their performance on a set of downstream tasks exploring various aspects of the
speech signal. However, and while the number of considered tasks has been
growing, most rely upon a single decoding architecture that maps the frozen SSL
representations to the downstream labels. This work investigates the robustness
of such benchmarking results to changes in the decoder architecture.
Interestingly, it appears that varying the architecture of the downstream
decoder leads to significant variations in the leaderboards of most tasks.
Concerningly, our study reveals that benchmarking using limited decoders may
cause a counterproductive increase in the sizes of the developed SSL models.Comment: 6 page
CL-MASR: A Continual Learning Benchmark for Multilingual ASR
Modern multilingual automatic speech recognition (ASR) systems like Whisper
have made it possible to transcribe audio in multiple languages with a single
model. However, current state-of-the-art ASR models are typically evaluated on
individual languages or in a multi-task setting, overlooking the challenge of
continually learning new languages. There is insufficient research on how to
add new languages without losing valuable information from previous data.
Furthermore, existing continual learning benchmarks focus mostly on vision and
language tasks, leaving continual learning for multilingual ASR largely
unexplored. To bridge this gap, we propose CL-MASR, a benchmark designed for
studying multilingual ASR in a continual learning setting. CL-MASR provides a
diverse set of continual learning methods implemented on top of large-scale
pretrained ASR models, along with common metrics to assess the effectiveness of
learning new languages while addressing the issue of catastrophic forgetting.
To the best of our knowledge, CL-MASR is the first continual learning benchmark
for the multilingual ASR task. The code is available at
https://github.com/speechbrain/benchmarks.Comment: 16 pages, 5 figures, 5 table